We consider the stochastic multi-armed bandit problem with a priordistribution on the reward distributions. We are interested in studyingprior-free and prior-dependent regret bounds, very much in the same spirit asthe usual distribution-free and distribution-dependent bounds for thenon-Bayesian stochastic bandit. Building on the techniques of Audibert andBubeck [2009] and Russo and Roy [2013] we first show that Thompson Samplingattains an optimal prior-free bound in the sense that for any priordistribution its Bayesian regret is bounded from above by $14 \sqrt{n K}$. Thisresult is unimprovable in the sense that there exists a prior distribution suchthat any algorithm has a Bayesian regret bounded from below by $\frac{1}{20}\sqrt{n K}$. We also study the case of priors for the setting of Bubeck et al.[2013] (where the optimal mean is known as well as a lower bound on thesmallest gap) and we show that in this case the regret of Thompson Sampling isin fact uniformly bounded over time, thus showing that Thompson Sampling cangreatly take advantage of the nice properties of these priors.
展开▼